Exploiting Emergent Schemas to Make RDF Systems More Efficient
نویسندگان
چکیده
We build on our earlier finding that more than 95% of the triples in actual RDF triple graphs have a remarkably tabular structure, whose schema does not necessarily follow from explicit metadata such as ontologies, but for which an RDF store can automatically derive by looking at the data using so-called “emergent schema” detection techniques. In this paper we investigate how computers and in particular RDF stores can take advantage from this emergent schema to more compactly store RDF data and more efficiently optimize and execute SPARQL queries. To this end, we contribute techniques for efficient emergent schema aware RDF storage and new query operator algorithms for emergent schema aware scans and joins. In all, these techniques allow RDF schema processors fully catch up with relational database techniques in terms of rich physical database design options and efficiency, without requiring a rigid upfront schema structure definition. 1 Emergent Schema Introduction In previous work [15], we introduced emergent schemas: finding that >95% of triples in all LOD datasets we tested, including noisy data such as WebData Commons and DBpedia, conform to a small relational tabular schema. We provided techniques to automatically and at little computational cost find this “emergent” schema, and also to give the found columns, tables, and “foreign key” relationships between them short human-readable labels. This label-finding, and in fact the whole process of emergent schema detection, exploits not only value distributions and connection patterns between the triples, but also additional clues provided by RDF ontologies and vocabularies. A significant insight from that paper is that relational and semantic practitioners give different meanings to the word “schema”. It is thus a misfortune that these two communities are often distinguished from each other by their different attitude to this ambiguous concept of “schema” – the semantic approach supposedly requiring no upfront schema (“schema-last”) as opposed to relational databases only working with a rigid upfront schema (“schema-first”). Semantic schemas, primarily ontologies and vocabularies, aim at modeling a knowledge universe in order to allow diverse current and future users to denote these concepts in a universally understood way in many different contexts. Relational database schemas, on the other hand, model the structure of one particular c © Springer International Publishing AG 2016 P. Groth et al. (Eds.): ISWC 2016, Part I, LNCS 9981, pp. 463–479, 2016. DOI: 10.1007/978-3-319-46523-4 28 464 M.-D. Pham and P. Boncz dataset (i.e., a database), and are not designed with a purpose of re-use in different contexts. Both purposes are useful: relational database systems would be easier to integrate with each other if the semantics of a table, a column and even individual primary key values (URIs) would be well-defined and exchangeable. Semantic data applications would benefit from knowledge of the actual patterns of co-occurring triples in the LOD dataset one tries to query, e.g. allowing users to more easily formulate SPARQL queries with a non-empty result (this often results from using a non-occurring property in a triple pattern). In [15], we observed partial and mixed usage of ontology classes across LOD datasets: even if there is an ontology closely related to the data, only a small part of its class attributes actually occur as triple properties (partial use), and typically many of the occurring attributes come from different ontologies (mixed use). DBpedia on average populates <30% of the class attributes it defines [15], and each actually occurring class contains attributes imported from no less than 7 other ontologies on average. This is not necessarily bad design, rather good re-use (e.g. foaf), but it underlines the point that any single ontology class is a poor descriptor of the actual structure of the data (i.e., a “relational” schema). Emergent schemas are helpful for human RDF users, but in this paper, we investigate how RDF stores can exploit emergent schemas for efficiency. We address three important problems faced by RDF stores. The first and foremost problem is the high execution cost resulting from the large amount of self-joins that the typical SPARQL processor (based on some form of triple table storage) must perform: one join per additional triple pattern in the query. It has been noted [7] that SPARQL queries very often contain star-patterns (triple patterns that share a common subject variable), and if the properties of the patterns in these stars reference attributes from the same “table”, the equivalent relational query can be solved with a table scan, not requiring any join. Our work achieves the same reduction of the amount of joins for SPARQL. The second problem we solve is the low quality of SPARQL query optimization. Query optimization complexity is exponential in the amount of joins [17]. In queries with more than 12 joins or so, optimizers cannot analyze the full search space anymore, potentially missing the best plan. Note that SPARQL query plans typically have F times more joins than equivalent SQL plans. Here F is the average size of a star pattern. This leads to a 3 times larger search space. Additionally, query optimizers depend on cost models for comparing the quality of query plan candidates, and these cost models assume independence of (join) predicates. In case of star patterns on “tables”, however, the selectivity of the predicates is heavily correlated (e.g. subjects that have an ISBN property, typically instances of the class Book, have a much higher join hit ratio with AuthoredBy triples than the independence assumption would lead to predict) which means that the cost model is often wrong. Taken together, this causes the quality of SPARQL query optimization to be significantly lower than in SQL. 1 A query of X stars has X ×F triple patterns, so needs P1 = X ×F − 1 joins. When each star is collapsed into one tablescan, just P2 = (X − 1) joins remain: P1 P2 ≥ F times. Exploiting Emergent Schemas to Make RDF Systems More Efficient 465 Our work eliminates many joins, making query optimization exponentially easier, and eliminates the biggest source of correlations that disturb cost modeling (joins between attributes from the same table). The third problem we address is that mission-critical applications that depend on database performance can be optimized by database administrators using a plethora of physical design options in relational systems, yet RDF system administrators lack all of this. A simple example are clustered indexes that store a table with many attributes in the value order of one or more sort key attributes. For instance, in a data warehouse one may store sales records ordered by Region first and ProductType second – since this accelerates queries that select on a particular product or region. Please note that not only the Region and ProductType properties are stored in this order, but all attributes of the sales table, which are typically retrieved together in queries (i.e. via a star pattern). A similar relational physical design optimization is table-partitioning or even database cracking [9]. Up until this paper, one cannot even think of the RDF equivalent of these, as table clustering and partitioning implies an understanding of the structure of an RDF graph. Emergent schemas allow to leave the “pile of triples” quagmire, so one can enter structured data management territory where advanced physical design techniques become applicable. In all, we believe our work brings RDF datastores on par with SQL stores in terms of performance, without losing any of the flexibility offered by the RDF model, thus without introducing a need to create upfront or enforce subsequently any explicit relational schema. 2 Emergent Schema Aware RDF Storage The original emergent schema work allows to store and query RDF data with SQL systems, but in that case the SQL query answers account for only those “regular” triples that fit in the relational tables. In this work, our target is to answer SPARQL queries over 100% of the triples correctly, but still improve the efficiency of SPARQL systems by exploiting the emergent schema. RDF systems store triple tables T in multiple orders of Subject (S), Property (P) and Object (O), among which typically TPSO (“column-wise”), TSPO (“rowwise”) and either TOSP or TOPS (“value-indexed”) – or even all permutations. In our proposal, RDF systems storage should become emergent schema aware by only changing the TPSO representation. Instead of having a single TPSO triple table, it gets stored as a set of wide relational tables in a column-store – we use MonetDB here. These tables represent only the regular triples, the remaining <5% of “exception” triples that do not fit the schema (or were updated recently) remain in a smaller PSO table Tpso. Thus, TPSO is replaced by the union of a smaller Tpso table and a set of relational tables. 2 To support named RDF graphs, the triples are usually extended to quads. Our approach trivially extends to that but we discuss triple storage here for brevity. 466 M.-D. Pham and P. Boncz Relational storage of triple data has been proposed before (e.g. property tables [20]), though these prior approaches advocated an explicit and humancontrolled mapping to a relational schema, rather than a transparent, adaptive and automatic approach, as we do. While such relational RDF approaches have performance advantages, they remained vulnerable in case SPARQL queries do not consist mainly of star patterns and in particular when they have triple patterns where the P is a variable. This would mean that many, if not all, relational tables could contribute to a query result, leading to huge generated SQL queries which bring the underlying SQL technology to its knees. Our proposal hides relational storage behind TPSO, and has as advantage that SPARQL query execution can always fall back on existing mechanisms – typically MergeJoins between scans of TSPO, TPSO and TOPS . Our approach at no loss of flexibility, just makes TPSO storage more compact as we will discuss here, and creates opportunities for better handling of star patterns, both in query optimization and query execution, as discussed in the following sections. Formal Definition. Given the RDF triple dataset Δ = {t|t = (tS , tP , tO)}, an emergent schema (Δ, E , μ) specifies the set E of emergent tables Tk, and mapping μ from triples in Δ to emergent tables in E . A common idea we apply is rather than storing URIs as some kind of string, to represent them as an OID (object identifier) – in practice as a large 64-bit integer. The RDF system maintains a dictionary D : OID → URI elsewhere. We use this D dictionary creatively, adapting it to the emergent schema. Definition 1. Emergent tables (E = {T1, ..}): Let s, p1, p2,. . . , pn be subject and properties with associated data types OID and D1,D2, . . . , Dn, then Tk = (Tk.s:OID, Tk.p1:D1, Tk.p2:D2, . . . , Tk.pn:Dn) is an emergent table where Tk.pj is a column corresponding to the property pj and Tk.s is the subject column. Definition 2. Dense subject columns: Tk.s consists of densely ascending numeric values βk, .. βk + |Tk| − 1, so s is something like an array index, and we denote Tk [s] .p as the cell of row s and column p. For each Tk its base OID βk = k ∗ 2. By choosing βk to be sufficiently apart, in practice the values of column Ti.s and Tj .s never overlap when i = j. Definition 3. Triple-Table mapping (μ : Δ → E): For each table cell Tk [s] .pj with non-NULL value o, ∃(s, pj , o) ∈ Δ and μ(s, pj , o) = Tk. These triples we call “regular” triples. All other triples t ∈ Δ are called “exception” triples and μ(t) = Tpso. In fact Tpso is exactly the collection of these exception triples. The emergent schema detection algorithm [15] assigns each subject to at most 1 emergent table – our storage exploits this by manipulating the URI dictionary D so that it gives dense numbers to all subjects s assigned to the same Tk. 3 In our current implementation with 64-bit OIDs we thus can support up to 2 emergent tables with each up to 2 = 1 trillion subjects, still leaving the highest 8 bits free, which are used for type information – see footnote 4. Exploiting Emergent Schemas to Make RDF Systems More Efficient 467
منابع مشابه
On Ranking RDF Schema Elements (and its Application in Visualization)
Ranking is a ubiquitous requirement whenever we confront a large collection of atomic or interrelated artifacts. This paper elaborates on this issue for the case of RDF schemas. Specifically, several metrics for evaluating automatic methods for ranking schema elements are proposed and discussed. Subsequently the creation of a test collection for evaluating such methods is described, upon which ...
متن کاملWarehousing RDF Graphs∗
Research in data warehousing (DW) has developed expressive and efficient tools for the multidimensional analysis of large amounts of data. As more data gets produced and shared in RDF, analytic concepts and tools for analyzing such irregular, graph-shaped, semantic-rich data need to be revisited. We introduce the first all-RDF model for warehousing RDF graphs. Notably, we define analytical sche...
متن کاملCapturing Relational Schemas and Functional Dependencies in RDFS
Mapping relational data to RDF is an important task for the development of the Semantic Web. To this end, the W3C has recently released a Recommendation for the so-called direct mapping of relational data to RDF. In this work, we propose an enrichment of the direct mapping to make it more faithful by transferring also semantic information present in the relational schema from the relational wor...
متن کاملQuerying Community Web Portals
A new generation of information systems such as organizational memories, vertical aggregators, infomediaries, etc. is emerging nowadays. Such systems, termed Community Web Portals, intend to support speci c communities of interest (e.g., enterprise, professional, trading) on corporate intranets or the Web. More precisely, Portal Catalogs, organize and describe various information resources (e.g...
متن کاملRDF Databases for Querying XML. A Model-mapping Approach
Some recent research works face the challenge to map XML documents to RDF triples. Ontologies are used to establish semantic connections among XML applications, and some mechanisms have been defined to query them with natural XML query languages like XPath and XML Query. Generally structure-mapping approaches define a simple translation between trivial XPath expressions and a target RDF query l...
متن کامل